[PySpark] - Add mapInPandas and mapInArrow methods to DataFrame class by mariotaddeucci · Pull Request #325 · duckdb/duckdb-python

mariotaddeucci · 2026-02-17T18:35:11Z

No description provided.

mariotaddeucci · 2026-03-10T23:47:00Z

Hey @evertlammerts, will the main branch be updated, or should I point this PR to another branch?

evertlammerts · 2026-03-11T13:27:49Z

Hey @mariotaddeucci, there's a merge PR up at #351. As soon as that works you can rebase this.

Copilot

Pull request overview

Adds PySpark-compatible DataFrame.mapInPandas and DataFrame.mapInArrow APIs to DuckDB’s experimental Spark DataFrame implementation, along with typing support and tests.

Changes:

Implement mapInArrow (Arrow RecordBatch iterator in/out) and mapInPandas (pandas DataFrame iterator in/out) on DataFrame.
Add iterator-function typing aliases for Pandas/Arrow mapping functions.
Add fast tests covering basic behavior, empty results, and a “no data loss” scenario; update DuckDB submodule revision.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File	Description
tests/fast/spark/test_spark_dataframe_map_in.py	Adds tests for `mapInPandas`/`mapInArrow`, including empty output and large dataset validation.
duckdb/experimental/spark/sql/dataframe.py	Implements `mapInArrow` and `mapInPandas` methods on `DataFrame` with docstrings and limited feature support.
duckdb/experimental/spark/_typing.py	Introduces `PandasMapIterFunction` and `ArrowMapIterFunction` type aliases.
external/duckdb	Bumps DuckDB submodule commit to pick up required functionality.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

duckdb/experimental/spark/sql/dataframe.py

+        ds = dataset(reader)  # noqa: F841
+        df = DataFrame(self.session.conn.sql("SELECT * FROM ds"), self.session)


duckdb/experimental/spark/sql/dataframe.py

+        schema : :class:`pyspark.sql.types.DataType` or str
+            the return type of the `func` in PySpark. The value can be either a
+            :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string.


duckdb/experimental/spark/sql/dataframe.py

+        >>> def mean_age(iterator):
+        ...     for pdf in iterator:
+        ...         yield pdf.groupby("id").mean().reset_index()
+        >>> df.mapInPandas(mean_age, "id: bigint, age: double").show()


duckdb/experimental/spark/_typing.py

+import pyarrow
 from numpy import float32, float64, int32, int64, ndarray
+from pandas import DataFrame as PandasDataFrame


duckdb/experimental/spark/_typing.py

+DataFrameLike = PandasDataFrame
+
+PandasMapIterFunction = Callable[[Iterable[DataFrameLike]], Iterable[DataFrameLike]]
+
+ArrowMapIterFunction = Callable[[Iterable[pyarrow.RecordBatch]], Iterable[pyarrow.RecordBatch]]


tests/fast/spark/test_spark_dataframe_map_in.py

+        n = 10_000_000
+
+        pandas_df = pd.DataFrame(
+            {
+                "id": np.arange(n, dtype=np.int64),
+                "value_float": np.random.rand(n).astype(np.float32),
+                "value_int": np.random.randint(0, 1000, size=n, dtype=np.int32),
+                "category": np.random.randint(0, 10, size=n, dtype=np.int8),
+            }
+        )


tests/fast/spark/test_spark_dataframe_map_in.py

+        generated_pandas_df = df.toPandas()
+        total_records = df.count()


tests/fast/spark/test_spark_dataframe_map_in.py

+        total_records = df.count()
+
+        assert total_records == n
+        assert pandas_df["id"].equals(generated_pandas_df["id"])


Add mapInPandas and mapInArrow methods to DataFrame class with tests

e00f36c

mariotaddeucci added 2 commits March 18, 2026 22:22

Merge origin/main into feature/pyspark-dataframe-map-in-functions

e2b020b

fix: add Optional import and fix ANN401 in mapInArrow/mapInPandas

6b580ce

Copilot AI review requested due to automatic review settings March 19, 2026 01:23

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PySpark] - Add mapInPandas and mapInArrow methods to DataFrame class#325

[PySpark] - Add mapInPandas and mapInArrow methods to DataFrame class#325
mariotaddeucci wants to merge 3 commits intoduckdb:mainfrom
mariotaddeucci:feature/pyspark-dataframe-map-in-functions

mariotaddeucci commented Feb 17, 2026

Uh oh!

mariotaddeucci commented Mar 10, 2026

Uh oh!

evertlammerts commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		ds = dataset(reader) # noqa: F841
		df = DataFrame(self.session.conn.sql("SELECT * FROM ds"), self.session)

		generated_pandas_df = df.toPandas()
		total_records = df.count()

Conversation

mariotaddeucci commented Feb 17, 2026

Uh oh!

mariotaddeucci commented Mar 10, 2026

Uh oh!

evertlammerts commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants